Final  Report 


Title:  Chance  Discovery  with  Data  Crystallization 

Discovering  Unobservable  Events 


AFOSR/AOARD  Reference  Number:  AOARD-06-4062 
AFOSR/AOARD  Program  Manager:  Hiroshi  Motoda,  Ph.D. 


Period  of  Performance:  01  Sept  2006  -  1  Sept  2007 


Submission  Date:  19  November  2007 


PI:  Katsumi  Nitta  /  Tokyo  Institute  of  Technology 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

12  APR  2007 


2.  REPORT  TYPE 


3.  DATES  COVERED 


4.  TITLE  AND  SUBTITLE  5a.  CONTRACT  NUMBER 

Chance  Discovery  with  Data  Crystallization  -  Discovering  Unobservable  FA48690610120 
Events 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S)  5d.  PROJECT  NUMBER 

Katsumi  Nitta  5e  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES)  8.  PERFORMING  ORGANIZATION 

Tokyo  Institute  of  Technology, 4259  Nagatsuta,  Midori-ku, Yokohama  report  number 

,JP, 176-0024  N/A 

9.  SPONSORING/MONITORING  AGENCY  NAME(S )  AND  ADDRESS(ES )  10.  SPONSOR/MONITOR' S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited. 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

he  method  of  data  crystallizing  reveals  the  hidden  structure  by  inserting  dummy  items  corresponding  to 
unobservable,  i.e.,  hidden  events,  to  the  given  incomplete  and  ill-structured  data  on  past  events.  The 
existence  of  those  hidden  events  and  their  location  in  the  environment  were  visualized  as  a  result  of  data 
crystallization.  The  method  was  evaluated  by  applying  to  1)  the  simulated  data  using  the  9/11  terrorist 
network  and  2)  test  data  provided  by  Dr.  Bob  Schrag  via.  It  was  further  applied  to  two  real-business 
domains:  1)  redesigning  surface  inspection  system  (SIS),  and  2)  extracting  the  essence  of  flow  of  arguments 
in  the  negotiation  logs  for  hypothetical  two  companies  with  respect  to  three  business  proposals. 


6.  AUTHOR(S) 

Katsumi  Nitta 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Tokyo  Institute  of  Technology, 4259  Nagatsuta,  Midori-ku, Yokohama 
,JP, 176-0024 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 


a.  REPORT 

unclassified 


b.  ABSTRACT 

unclassified 


c.  THIS  PAGE 

unclassified 


17.  LIMITATION  OF 

18.  NUMBER 

ABSTRACT 

OF  PAGES 

51 

19a.  NAME  OF 
RESPONSIBLE  PERSON 


Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


(2)  Objectives:  Briefly  summarize  the  objectives  of  the  research  effort  or  the  statement  of  work. 

It  is  only  the  observable  part  of  the  real  world  that  can  be  presented  in  data.  From  such  scattered  data,  i.e., 
incomplete  and  ill-structured  data,  the  method  of  data  crystallizing  presented  by  Ohsawa  in  a  preceding  project 
(AOARD-05-15)  revealed  the  hidden  structure  by  inserting  dummy  items  corresponding  to  unobservable,  i.e., 
hidden  events,  to  the  given  data  on  past  events.  The  existence  of  those  hidden  events  and  their  position  in  the 
environment  were  visualized  as  a  result  of  data  crystallizing.  This  year,  this  basic  method  has  been  extended  to 
be  applicable  to  various  real  world  domains  such  as  intelligence  analysis  of  terrorist  networks,  product 
development,  and  also  to  noticing  essential  hidden  assertions  in  disputes.  In  this  project,  the  researchers  led  by 
new  PI  Nitta  has  been  developing  a  human-centric  process  for  improving  the  performance  of  data  crystallization, 
with  inventing  (a)  a  new  tool  extending  KeyGraph,  and  (2)  a  process  to  involve  human’s  interpretation  and  the 
iterative  manipulation  of  the  visualized  graph.  The  planned  experiments  in  the  proposal  were  to  analyze: 

Artificial  data  obtained  from  simulating  the  target  of  intelligence  analysis,  i.e.,  organized  crimes. 

Other  kinds  of  data,  which  are  matching  emerging  social  interests,  e.g.,  the  real  text  on  the  conversations  in  the 
court,  and  the  data  on  humans’  movements  in  town. 

(3)  Status  of  effort:  A  brief  statement  of  progress  towards  achieving  the  research  objectives.  (Limit  this 
section  to  about  200  words  or  less.) 

The  basic  algorithm  of  data  crystallizing  has  been  realized  in  AOARD-054016,  to  visualize  unobservable 
events  and  their  relations  with  other  events.  This  year,  new  PI  Katsumi  Nitta  succeeded  the  work  above,  to  organize 
the  project  on  data  crystallization  to  be  a  more  widely  applicable  method  for  visualizing  hidden  events  in  the  real 
world.  Nitta  is  progressing  this  year  with  the  assist  of  Yoshiham  Maeno,  who  is  a  PhD  candidate  under  Ohsawa’s 
supervision.  This  project  finally  established  a  method  to  discover  a  node  which  is  significantly  relevant  to  others  in 
a  complex  social  network  but  missing  in  the  data.  The  problem  was  difficult,  not  only  because  such  a  node 
(corresponding  to  a  leader  of  terrorist  group)  appears  infrequently  and  non-routinely,  but  also  because  the  logs  of 
analysts’  surveillance  on  the  covert  social  network  is  hardly  available.  We  invented  a  method  for  integrating  the 
investigator’s  prior  understanding,  insight  on  the  target  social  network,  and  computational  data  processing.  We 
evaluated  the  method  by  applying  to  the  simulated  data  using  the  9/11  terrorist  network  and  test  data  provided  by  Dr. 
Bob  Schrag  via  Dr.  Tae-Woo  Park.  We  also  applied  to  business  problems.  Inventing  a  new  idea  in  corporate 
research  and  development  is  studied. 

(4)  Abstract:  Briefly  describe  research  accomplishments,  their  significance  to  the  field,  and  their  relationship  to 
the  original  goals. 

Accomplishments 

a.  Stage  1)  Development  of  basic  tool :  For  a  scattered,  i.e.,  an  incomplete  and  ill-structured  dataset,  we  realized  a 
tool  for  data  crystallizing  which  inserts  dummy  items,  corresponding  to  unobservable  events.  The  existence  of 
these  unobservable  events  and  their  relations  with  other  events  are  visualized  by  applying  KeyGraph  iteratively 
to  the  data  donated  with  dummy  items,  gradually  increasing  the  number  of  edges  in  the  graph,  like  the 
crystallization  of  snow  with  gradual  decrease  in  the  air  temperature.  For  tuning  the  granularity  level  of 
structure  to  be  visualized,  this  tool  is  integrated  with  human’s  process  of  chance  discovery.  This  basic  method 
came  to  be  proven  applicable  for  the  discovery  of  hidden  leaders  of  meetings,  i.e.,  managers  who  do  not  appear 
in  the  meeting  room  but  are  sending  commands  to  the  members  who  appear  in  the  meetings. 

b.  Stage  2)  Refinement  of  the  method  by  weighing  human ’s  role  in  the  process  of  discovery  :  We  addressed  hidden 
structure  visualization  adaptive  to  human's  prior  understanding.  Visualization  can  be  adjusted  based  on  the 
degree  of  the  user's  prior  understanding  of  the  problem  domain.  The  degree  is  represented  by  a  temperature 
parameter  used  in  the  human-interactive  annealing  along  with  stable  deterministic  crystallization  algorithm. 
When  the  understanding  of  the  problem  is  believed  to  be  richer,  the  temperature  shall  be  set  higher.  More 
complex  higher-order  hidden  structures  shall  be  revealed.  This  will  lead  to  the  discovery  of  unique  and 
unexpected  scenario.  On  the  other  hand,  when  the  understanding  is  poorer,  the  temperature  shall  be  set  lower. 
The  user  should  try  to  understand  the  basic  lower-order  structures  from  the  event  graph.  Such  adaptive  nature  is 
convenient  to  discover  unexpected  scenarios  in  the  individual  user’s  own  perspective.  The  adaptive  nature  of 
the  annealing  process  was  demonstrated  for  examples  of  social  network  visualizations  from:  (1)  Test  data 
generated  from  a  scale-free  network,  resulting  in  the  discovery  precision  of  up  to  90%.  (2)  Real  on-line 
communication  where  people  met  for  group  decision,  resulting  in  precisely  discovering  real  leaders  who  had 
been  deleted  from  the  data  of  communication  (3)  data  of  persons  related  to  famous  politicians. 


c.  Stage  3)  Extension  of  the  algorithm  and  the  process,  and  evaluations  for  simulated  datasets  :  The  extension  is 
an  interactive  process  starting  from  the  analysts’  surveillance,  where  the  hypotheses  on  the  latent  structure  are 
discussed.  The  algorithm,  in  case  of  terrorist  analysis,  visualizes  the  data  on  the  terrorists’  communication  and 
show  a  social  network  diagram  to  the  analysts.  It  consists  of  clustering  and  ranking  procedure.  In  the  clustering 
procedure,  the  activeness  of  communications  between  the  terrorists  are  computed  and  visualized,  and  uses  the 
analysts’  prior  knowledge  such  as  the  normal  number  of  groups  or  the  known  group  leaders.  The  ranking 
procedure  computes  the  likeliness  of  inter-cluster  relationships,  which  originates  in  the  unobserved  person 
hidden  in  the  empty  spots  between  the  clusters,  and  indicates  the  position  of  the  person  as  a  red  node.  The 
investigators  compare  the  visualized  social  network  diagram  and  prior  knowledge,  and  update  the  prior 
understanding,  iterate  the  above  procedures,  and  finally  invent  a  hypothesis  on  the  latent  structure  (Maeno, 
2007).  The  details  of  the  method  are  presented  in  the  attachment  I.  The  two  figures  in  comparison  shows  an 
example  of  result.  Other  results  shown  in  the  attachment  shows  the  high  accuracy  of  detecting  hidden  leaders, 
according  to  our  application  to  the  AOARD  test  data. 


Figure  A  (corresponding  to  Fig.4  in  attachment  I)  Social  network  diagram  representing  the  observed  19  hijackers  responsible  for 
the  9/1 1  attack  in  figure  3  with  the  revealed  18  covert  conspirators  (Krebs,  2002). 
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Figure  B  (corresponding  to  Fig.  1 0  in  attachment  I)  4  clusters  and  1 0  highly  ranked  red  nodes  corresponding  to  Mustafa  A.  Al- 
Hisawi  hidden  in  the  suspicious  records.  Waleed  Alshehri  and  Mohand  Alshehri  are  retrieved  as  neighbor  persons  of  red  nodes. 


d.  Stage  4)  Application  to  business  problems'.  The  method  has  been  applied  to  two  real -business  domains  (1) 
redesigning  surface  inspection  system  (SIS),  a  machine  for  detecting  defects  on  couple  charged  devices  (CCD). 
(2)  Extracting  the  characteristics  of  each  group  and  understanding  the  flow  of  their  particular  argument:  This 
has  been  done  by  humans  taking  long  time  so  far,  so  we  aimed  to  employ  the  assistance  of  a  computer  to 
understand  an  argument  outline  and  obtain  some  materials  for  educating  reasoning  skills  in  mediation  or 
negotiation.  Therefore,  we  applied  KeyGraph  and  Data  Crystallization  technologies  in  order  to  attempt  an 
analysis  of  long  texts  in  a  collaboration  between  humans  and  computer  based  systems.  Several  researchers  have 
already  applied  these  technologies  to  analyze  speech  texts.  However,  our  text  logs  have  following  two  features; 
(1)  we  have  several  argumentation  logs  regarding  the  same  issues,  (2)  we  can  estimate  important  issues  before 
the  moot  arbitration  or  the  moot  negotiation  by  analyzing  the  case.  Here,  we  extended  KeyGraph  and  the  Data 
Crystallization  method  to  use  the  above  features  of  our  argumentation  logs,  and  achieved  our  aim. 

Significance  to  the  field 

The  basis  of  this  study  has  been  chance  discovery,  which  means  to  discover  a  chance,  defined  as  an  event 
significant  for  making  a  decision.  Using  existing  data  in  business  and  natural/social  sciences,  we  have  been 
achieving  successful  chance  discoveries  in  various  domains,  including  (not  restricted  to): 

Marketing,  where  consumer-behaviors  from  hidden  motivations  are  dealt  with, 

Prediction  of  earthquakes  caused  by  hidden  active  faults 

Hepatitis  treatment,  where  some  observation  might  be  missing  in  the  blood  test. 

In  studies  on  chance  discovery,  we  have  been  working  well  in  finding  rare  but  significant  events.  Data 
crystallizing  means  to  extend  chance  discovery  to  the  discovery  of  significant  events  which  have  never 
occurred  in  the  given  data,  i.e.,  from  low-frequency  to  zero-frequency.  This  means  to  deal  with  more  uncertain 
environment  where  human  may  miss  important  event,  than  we  have  been  dealing  with  in  data  mining  or 
chance  discovery.  We  can  compare  the  presented  study  with  previous  studies  in  existing  domains,  as  follows: 

1)  Empirical  studies  on  terrorist/criminal  social  networks:  Batallas  [Batallas,  2006]  applied 
centrality  [Freeman,  1979]  and  brokerage  [Cusumano,  2000]  to  analyzing  an  aircraft  engine  development 
project,  and  studied  the  influence  of  an  information  leader  team,  which  could  be  either  a  bottleneck  of 
information  flow  or  an  innovation  diffuser.  [Keila,  2006]  applied  factor  analysis  to  studying  email  exchanges 
in  Enron  which  lead  to  the  bankruptcy  due  to  the  institutionalized  accounting  fraud.  [Klerks,  2002]  pointed  out 
that  criminal  organizations  tend  to  be  strings  of  inter-linked  small  groups  that  lack  a  central  leader,  but  to 
coordinate  their  activities  along  logistic  trails  and  through  bonds  of  friends,  and  that  hypothesis  can  be  built  by 
paying  attention  to  remarkable  white  spots  and  hard-to-fill  positions  in  a  network.  Krebs  [Krebs,  2002] 
investigated  the  9/1 1  terrorist  network,  and  revealed  that  the  relevance  of  conspirators  who  reduce  the  distance 
between  hijackers  and  enhance  communication  efficiently.  Then,  Morselli  [Morselli,  2007]  investigated  Kreb’s 
network  from  the  viewpoint  of  efficiency  and  security  trade-off,  and  suggested  that  more  security-oriented 
structure  arises  from  longer  time-to-task  of  the  terrorists’  objectives,  and  that  conspirators  improve 
communication  efficiency,  preserving  hijackers’  small  visibility  and  exposure.  These  method  search 
objectively  important  members  and  the  links  between  them  in  the  community.  The  human-centric  process  of 
data  crystallization,  on  the  other  hand,  visualizes  candidate  positions  of  unobservable  leaders  in  the  network, 
and  aims  at  accelerating  the  subjective  interpretation  of  analyst,  based  on  his/her/their  knowledge. 

2)  Theories  and  computational  models  of  complex  networks:  Complex  network,  graph  theory,  and 
machine  learning  algorithms  help  us  in  obtaining  insight  on  the  dynamics  of  a  social  network,  in  addition  to 
summarizing  and  visualizing  a  network  [Shen,  2007],  and  analyzing  a  cognitive  network  [Krackhardt,  1987]. 
Scale-free  networks  [Barabasi,  1999]  and  small  worlds  [Watts,  1998]  present  us  much  insight  on  the  structure 
and  evolution  of  a  social  network:  scientists’  collaboration,  actors  in  movies  etc.  A  power  law  in  the  nodal 
degree  distribution  governs  the  scale-free  network.  Fenner  [Fenner,  2007]  proposed  an  exponential  cutoff 
mechanism  to  modify  the  power  law.  Error  attack  tolerance  [Albert,  2000]  and  search  efficiency  [Adamic, 
2001]  are  of  particular  interest  for  practical  applications.  These  have  been  the  basis  of  studies  in  1)  above,  and 
also  of  the  studies  on  the  visualization  part  of  data  crystallization. 

3)  Evidence  extraction  and  link  discovery:  A  relevant  research  area  to  Chance  Discovery,  where 
important  links  of  people  with  other  people  and  with  their  own  actions  are  to  be  discovered  from 
heterogeneous  sources  of  data,  is  Evidence  Extraction  and  Link  Discovery  (EELD).  The  difference  between 
Chance  Discovery  and  EELD,  at  the  time  we  began  this  project,  was  in  the  position  of  human  factors  in  the 


research  approaches.  In  Chance  Discovery,  the  visualization  techniques  such  as  KeyGraph  have  been  used  for 
clarifying  the  effect  of  chances,  by  enforcing  the  user’s  thoughts  on  scenarios  in  the  real  environment.  On  the 
other  hand,  the  EELD  program  mainly  contributed  to  identifying  the  most  significant  links  among  items  more 
automatically  and  precisely  than  human.  After  the  one  year  of  this  successful  project,  we  showed  an 
improvement  of  the  visualization  tool  reinforces  the  process  of  chance  discovery,  and  this  may  be  regarded  as 
a  new  feature  of  the  state  of  chance  discovery. 

Link  discovery  has  been  applied  to  predicting  collaboration  between  scientists  from  the  published  co¬ 
authorship  [Liben-Nowell,  2004].  Adamic  [Adamic,  2003]  proposed  a  technique  to  infer  friends  and  neighbors 
from  the  information  available  on  the  web.  [Singh,  2004]  applied  a  hidden  Markov  model  and  a  Bayesian 
network  to  predict  the  behavior  of  terrorists.  Learning  of  a  Bayesian  network  is  extended  to  study  the 
probabilistic  nature  of  latent  variables.  Silva  [Silva,  2006]  studied  learning  of  a  structure  of  a  linear  latent 
variable  graph.  Friedman  [Friedman,  1998]  studied  learning  of  a  structure  of  a  dynamic  probabilistic  network. 
The  principled  analytic  approach  often  suffers  from  complexity  problem.  The  complexity  includes  bi¬ 
directional  and  cyclic  influence  among  the  many  observed  and  latent  nodes  (beyond  a  triad:  1  latent  node 
influencing  2  observed  nodes). 


Relation  to  the  goal 

The  sphere  of  real  world  applications  linked  from  this  basic  research  is  expected  to  include  intelligence 
analysis  aiming  to  arrest  unknown  leaders,  development  of  the  ideas  about  new  (unknown)  products, 
understanding  the  hidden  but  important  issues  in  arbitration  or  negotiation,  etc.  We  successfully  accomplished  to 
show  the  ability  of  our  methods  to  solve  these  new  problems,  by  applying  to  simulated  complex  problems  and 
scaled-down  simplified  versions  of  these  real  up-to-date  problems. 
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Appendix  I:  An  Analysis  of  Terrorist  Organization 

Appendix  II:  Corporate  R&D  Projects  aided  by  Data  Crystallization 
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